-
Notifications
You must be signed in to change notification settings - Fork 635
Accelerate coverage processing with hot-path vectorization #126
base: master
Are you sure you want to change the base?
Conversation
There seems to be a bug in the code |
Also I did not notice an impact on the execs_done when run for the same time. however that could be because of extra code paths because of falsely detected variable paths |
@vanhauser-thc Thanks for the note! My own experimental code is messy, and I had a difficult time organizing it for this PR. Now the instability bug is fixed. As for "execs_done", can you please show your experiment setup? In my case, I ran the vectorized version and the baseline version for 10s with UI disabled. The overall speedup ranges from 5% to 20% with AVX2 enabled. More speedup can be observed if the coverage region is sparse or AVX512 is enabled. The rationale behind short execution time is reducing randomness introduced by other parts in AFL. Despite "execs_done" is an intuitive and conclusive metric, it is determined by (a) the execution speed of target program and (b) AFL itself. In my observation, the vectorized version is always faster in the beginning, thus it discovers more seeds than the baseline version. However, because the initial seeds are simple, the discovered new seeds are usually slower to execute. Therefore, more time is spent at the target program, and the vectorized version can be much more slower on some specific seeds. Consequently, I prefer micro-benchmarking because of the isolation of noises. |
I can confirm this fixes the stability issue. |
Thanks! |
@jonathanmetzman No hurry for landing it -- have a good time on your holidays 😄 Looking forward to your review! |
@hghwng I added your patch to afl++, although I had to made several adjustments as we have dynamic sized trace maps. will be merged the next days. I did a fuzzbench run https://www.fuzzbench.com/reports/experimental/2020-12-18/index.html which tested with (among 2 other things) and looks like also overall it slightly improved. Not sure how this performs on ARM systems though. |
@vanhauser-thc Thanks for the comprehensive evaluation on FuzzBench! As for the performance on ARM systems, I cannot give you a definite number due to the lack of hardware. But from a theoretical perspective, I think the performance gain persists. My patch is more than vectorization. It uses a one-pass design for most cases, removing redundant computation of the original two-pass algorithm. As the data of "NoSIMD" vs "Vanilla" shows, the performance gain can be still observed even without SIMD support. |
@mboehme Big thanks for your insightful comments! I've refined the documentation (see ebedb33) following your suggestion. As for the implementation of As Compiler Explorer shows, the generated assembly is identical for both clang and gcc under O3. In fact, no memcpy calls are emitted by both clang and gcc without enabling any optimization. What's more interesting, with Compiler Explorer's default configuration (gcc -O2), llvm-mca estimates that the memset version only requires 1.61 cycles per iter, while the pointer version requires 2.09 cycles. Quick Bench also confirms a ~1% performance gain of the memset version. The memset version is more optimization-friendly. Modern compilers replace libc's memcpy to their own internal versions for optimization opportunities (clang, gcc). In this case, the fixed-sized memcpy invocations are easily replaced with direct memory references and optimized with mem2reg passes. However, the pointer-based version has side effects and requires pointer analysis for the compiler to correctly optimize. For example, the compiler must know that Personally, I like the memset version because it looks pretty :-p |
Interesting! I changed your four casts to a single cast and can confirm for Without optimizations ( |
yes memcpy is inlined unless compiled otherwise - however some people are compiling with old gcc versions, or even non-gcc, e.g intel's icc etc. so that it is always inlined is not a given. IMHO the memcpy is fine because we want mainly to cater to people using the most current compilers, while making sure it still works for everyone, even on outdated systems. However on outdated systems people should not expect t have the top performance anyway. |
gcc is so over-optimized that simple changes can have a huge impact on the assembly generated. |
Can also confirm the performance of naive and memset version for -O3 -- from the oldest versions of Clang (3.8) and GCC (5.5) that are available on quickbench -- is equivalent. |
@vanhauser-thc The fact that security checks can be optimized out is especially interesting! Can you show some example? @mboehme Do you think I can mark these memcpy-related comments as resolved? BTW I speculate that the performance impact behind reordering can be explained by changed code layout (e.g. 64B window of Decode Stream Buffer). When doing micro benchmarks, it's always difficult to tell the exact result unless you get down to micro-architectural analysis. |
Sure. Fine with me. I like your PR. I'm wondering: You are still classifying all counts, right? If there is no difference, during skimming, you classify all counts. If there is a difference, during skimming, you classify some counts, fall back "to the old logic" and call |
Thank you! I also like your work on improving the directed fuzzing algorithms, which accelerates fuzzing from another interesting and important perspective.
Classifying the counter first still needs two passes. Even with skimming enabled, the second pass still requires re-reading the trace bits. Because memory access is much slower than register (used by this PR), I think classifying the count first can be slower. As for the extra work performed by this PR, the acceleration in hot path outweighs the cost of fallback in general as the FuzzBench result demonstrates. |
This PR results in segfaults when -march=native is used on Intel Xeon CPUs. super weird. I dont fault this code but the optimization. It happens with clang and gcc though. |
@vanhauser-thc I speculate that the alignment (64 bytes) requirement is not satisfied. I'm not familiar with how AFL++ allocates trace bits, but please ensure the data layout with compiler hints and APIs such as |
@hghwng I thought it is 32 byte, but if you say 64 then this explains why it fails. |
Summary
afl-fuzz spends lots of time processing coverage. Because most of the counters processed are zero and most of the executions do not need to be saved, most logic performed by AFL is redundant. In my measurements, the coverage pipeline can occupy up to 90% of afl-fuzz's user space time.
This pull request accelerates the most time-consuming part of AFL with a 3-tiered design. On coverage files collected from libxml2, the modified algorithm achieves a speedup of 54%. With AVX2 and AVX512, the speedup increases to 388% and 603% respectively.
See experiments for microbenchmark.
Background
Most executions do not yield any result in fuzzing. However, afl-fuzz still processes the coverage in two passes. The first pass classifies the raw counters to bitmaps (see
run_target
), and the second pass scans for new program behavior (has_new_bits
).The first pass's output, i.e., the classified
trace_bits
, is seldom used beyondhas_new_bits
. The memory writes to bitmap is wasted in most cases.The second pass involves control flow transfers (
ret
) and side effects (updates tovirgin
). Therefore, compilers cannot efficiently optimize the hot path with loop vectorization.Design
This pull request accelerates the most time-consuming part of AFL with a 3-tiered design. The lower tier is simpler, enable more optimization opportunities. The higher tier is slower, but can handle more complicated cases.
For example, on CPUs with AVX512:
zmm
, the comparison mask is used to locate the nonzero u64s inside thezmm
. The nonzero parts is read, classified, and compared with the virgin map.zmm
. When the nextzmm
is ready (we cannot switch immediately because of memory alignment requirements), the slow path switches back to the fast path.Implementation
The first commit refactors base coverage-processing routines into two new files. The 32-bit and 64-bit versions are separated for clarity.
The second commit disables classification in hot path. For
save_if_interesting
, the classification of trace bits is disabled. For other cold functions which obtain coverage, the classification is preserved.The last commit completes the hot path with acceleration. A new function,
has_new_bits_unclassified
is introduced. This function behaves likeclassify_counts
+has_new_bits
, with redundant computation removed and vectorization added.Result
Configuration: Clang 11.0.0, Xeon Gold 6148 (SKX)
Simulation: 30 new coverage, repeat each for 25,614 times (in my case the discovery rate is 1 input per 25,614 execution)
µarch Analysis